{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hyper-parameter tuning\n",
    "\n",
    "**Learning Objectives**\n",
    "1. Understand various approaches to hyperparameter tuning\n",
    "2. Automate hyperparameter tuning using AI Platform\n",
    "\n",
    "## Introduction\n",
    "\n",
    "In the previous notebook we achieved an RMSE of **4.13**. Let's see if we can improve upon that by tuning our hyperparameters.\n",
    "\n",
    "Hyperparameters are parameters that are set *prior* to training a model, as opposed to parameters which are learned *during* training. \n",
    "\n",
    "These include learning rate and batch size, but also model design parameters such as type of activation function and number of hidden units.\n",
    "\n",
    "Here are the four most common ways to finding the ideal hyperparameters:\n",
    "1. Manual\n",
    "2. Grid Search\n",
    "3. Random Search\n",
    "4. Bayesian Optimzation\n",
    "\n",
    "**1. Manual**\n",
    "\n",
    "Traditionally, hyperparameter tuning is a manual trial and error process. A data scientist has some intution about suitable hyperparameters which they use as a starting point, then they observe the result and use that information to try a new set of hyperparameters to try to beat the existing performance. \n",
    "\n",
    "Pros\n",
    "- Educational, builds up your intuition as a data scientist\n",
    "- Inexpensive because only one trial is conducted at a time\n",
    "\n",
    "Cons\n",
    "- Requires alot of time and patience\n",
    "\n",
    "**2. Grid Search**\n",
    "\n",
    "On the other extreme we can use grid search. Define a discrete set of values to try for each hyperparameter then try every possible combination. \n",
    "\n",
    "Pros\n",
    "- Can run hundreds of trials in parallel using the cloud\n",
    "- Gauranteed to find the best solution within the search space\n",
    "\n",
    "Cons\n",
    "- Expensive\n",
    "\n",
    "**3. Random Search**\n",
    "\n",
    "Alternatively define a range for each hyperparamter (e.g. 0-256) and sample uniformly at random from that range. \n",
    "\n",
    "Pros\n",
    "- Can run hundreds of trials in parallel using the cloud\n",
    "- Requires less trials than Grid Search to find a good solution\n",
    "\n",
    "Cons\n",
    "- Expensive (but less so than Grid Search)\n",
    "\n",
    "**4. Bayesian Optimization**\n",
    "\n",
    "Unlike Grid Search and Random Search, Bayesian Optimization takes into account information from  past trials to select parameters for future trials. The details of how this is done is beyond the scope of this notebook, but if you're interested you can read how it works here [here](https://cloud.google.com/blog/products/gcp/hyperparameter-tuning-cloud-machine-learning-engine-using-bayesian-optimization). \n",
    "\n",
    "Pros\n",
    "- Picks values intelligenty based on results from past trials\n",
    "- Less expensive because requires fewer trials to get a good result\n",
    "\n",
    "Cons\n",
    "- Requires sequential trials for best results, takes longer\n",
    "\n",
    "**AI Platform HyperTune**\n",
    "\n",
    "AI Platform HyperTune, powered by [Google Vizier](https://ai.google/research/pubs/pub46180), uses Bayesian Optimization by default, but [also supports](https://cloud.google.com/ml-engine/docs/tensorflow/hyperparameter-tuning-overview#search_algorithms) Grid Search and Random Search. \n",
    "\n",
    "\n",
    "When tuning just a few hyperparameters (say less than 4), Grid Search and Random Search work well, but when tunining several hyperparameters and the search space is large Bayesian Optimization is best."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROJECT = \"cloud-training-demos\"  # Replace with your PROJECT\n",
    "BUCKET = \"cloud-training-bucket\"  # Replace with your BUCKET\n",
    "REGION = \"us-central1\"            # Choose an available region for AI Platform\n",
    "TFVERSION = \"1.14\"                # TF version for AI Platform to use"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os \n",
    "os.environ[\"PROJECT\"] = PROJECT\n",
    "os.environ[\"BUCKET\"] = BUCKET\n",
    "os.environ[\"REGION\"] = REGION\n",
    "os.environ[\"TFVERSION\"] = TFVERSION "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Move code into python package\n",
    "\n",
    "Let's package our updated code with feature engineering so it's AI Platform compatible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "mkdir taxifaremodel\n",
    "touch taxifaremodel/__init__.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create model.py\n",
    "\n",
    "Note that any hyperparameters we want to tune need to be exposed as command line arguments. In particular note that the number of hidden units is now a parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile taxifaremodel/model.py\n",
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "import shutil\n",
    "print(tf.__version__)\n",
    "\n",
    "#1. Train and Evaluate Input Functions\n",
    "CSV_COLUMN_NAMES = [\"fare_amount\",\"dayofweek\",\"hourofday\",\"pickuplon\",\"pickuplat\",\"dropofflon\",\"dropofflat\"]\n",
    "CSV_DEFAULTS = [[0.0],[1],[0],[-74.0],[40.0],[-74.0],[40.7]]\n",
    "\n",
    "def read_dataset(csv_path):\n",
    "    def _parse_row(row):\n",
    "        # Decode the CSV row into list of TF tensors\n",
    "        fields = tf.decode_csv(records = row, record_defaults = CSV_DEFAULTS)\n",
    "\n",
    "        # Pack the result into a dictionary\n",
    "        features = dict(zip(CSV_COLUMN_NAMES, fields))\n",
    "        \n",
    "        # NEW: Add engineered features\n",
    "        features = add_engineered_features(features)\n",
    "        \n",
    "        # Separate the label from the features\n",
    "        label = features.pop(\"fare_amount\") # remove label from features and store\n",
    "\n",
    "        return features, label\n",
    "    \n",
    "    # Create a dataset containing the text lines.\n",
    "    dataset = tf.data.Dataset.list_files(file_pattern = csv_path) # (i.e. data_file_*.csv)\n",
    "    dataset = dataset.flat_map(map_func = lambda filename:tf.data.TextLineDataset(filenames = filename).skip(count = 1))\n",
    "\n",
    "    # Parse each CSV row into correct (features,label) format for Estimator API\n",
    "    dataset = dataset.map(map_func = _parse_row)\n",
    "    \n",
    "    return dataset\n",
    "\n",
    "def train_input_fn(csv_path, batch_size = 128):\n",
    "    #1. Convert CSV into tf.data.Dataset with (features,label) format\n",
    "    dataset = read_dataset(csv_path)\n",
    "      \n",
    "    #2. Shuffle, repeat, and batch the examples.\n",
    "    dataset = dataset.shuffle(buffer_size = 1000).repeat(count = None).batch(batch_size = batch_size)\n",
    "   \n",
    "    return dataset\n",
    "\n",
    "def eval_input_fn(csv_path, batch_size = 128):\n",
    "    #1. Convert CSV into tf.data.Dataset with (features,label) format\n",
    "    dataset = read_dataset(csv_path)\n",
    "\n",
    "    #2.Batch the examples.\n",
    "    dataset = dataset.batch(batch_size = batch_size)\n",
    "   \n",
    "    return dataset\n",
    "  \n",
    "#2. Feature Engineering\n",
    "# One hot encode dayofweek and hourofday\n",
    "fc_dayofweek = tf.feature_column.categorical_column_with_identity(key = \"dayofweek\", num_buckets = 7)\n",
    "fc_hourofday = tf.feature_column.categorical_column_with_identity(key = \"hourofday\", num_buckets = 24)\n",
    "\n",
    "# Cross features to get combination of day and hour\n",
    "fc_day_hr = tf.feature_column.crossed_column(keys = [fc_dayofweek, fc_hourofday], hash_bucket_size = 24 * 7)\n",
    "\n",
    "# Bucketize latitudes and longitudes\n",
    "NBUCKETS = 16\n",
    "latbuckets = np.linspace(start = 38.0, stop = 42.0, num = NBUCKETS).tolist()\n",
    "lonbuckets = np.linspace(start = -76.0, stop = -72.0, num = NBUCKETS).tolist()\n",
    "fc_bucketized_plat = tf.feature_column.bucketized_column(source_column = tf.feature_column.numeric_column(key = \"pickuplon\"), boundaries = lonbuckets)\n",
    "fc_bucketized_plon = tf.feature_column.bucketized_column(source_column = tf.feature_column.numeric_column(key = \"pickuplat\"), boundaries = latbuckets)\n",
    "fc_bucketized_dlat = tf.feature_column.bucketized_column(source_column = tf.feature_column.numeric_column(key = \"dropofflon\"), boundaries = lonbuckets)\n",
    "fc_bucketized_dlon = tf.feature_column.bucketized_column(source_column = tf.feature_column.numeric_column(key = \"dropofflat\"), boundaries = latbuckets)\n",
    "\n",
    "def add_engineered_features(features):\n",
    "    features[\"dayofweek\"] = features[\"dayofweek\"] - 1 # subtract one since our days of week are 1-7 instead of 0-6\n",
    "    \n",
    "    features[\"latdiff\"] = features[\"pickuplat\"] - features[\"dropofflat\"] # East/West\n",
    "    features[\"londiff\"] = features[\"pickuplon\"] - features[\"dropofflon\"] # North/South\n",
    "    features[\"euclidean_dist\"] = tf.sqrt(features[\"latdiff\"]**2 + features[\"londiff\"]**2)\n",
    "\n",
    "    return features\n",
    "\n",
    "feature_cols = [\n",
    "  #1. Engineered using tf.feature_column module\n",
    "  tf.feature_column.indicator_column(categorical_column = fc_day_hr),\n",
    "  fc_bucketized_plat,\n",
    "  fc_bucketized_plon,\n",
    "  fc_bucketized_dlat,\n",
    "  fc_bucketized_dlon,\n",
    "  #2. Engineered in input functions\n",
    "  tf.feature_column.numeric_column(key = \"latdiff\"),\n",
    "  tf.feature_column.numeric_column(key = \"londiff\"),\n",
    "  tf.feature_column.numeric_column(key = \"euclidean_dist\") \n",
    "]\n",
    "\n",
    "#3. Serving Input Receiver Function\n",
    "def serving_input_receiver_fn():\n",
    "    receiver_tensors = {\n",
    "        'dayofweek' : tf.placeholder(dtype = tf.int32, shape = [None]), # shape is vector to allow batch of requests\n",
    "        'hourofday' : tf.placeholder(dtype = tf.int32, shape = [None]),\n",
    "        'pickuplon' : tf.placeholder(dtype = tf.float32, shape = [None]), \n",
    "        'pickuplat' : tf.placeholder(dtype = tf.float32, shape = [None]),\n",
    "        'dropofflat' : tf.placeholder(dtype = tf.float32, shape = [None]),\n",
    "        'dropofflon' : tf.placeholder(dtype = tf.float32, shape = [None]),\n",
    "    }\n",
    "    \n",
    "    features = add_engineered_features(receiver_tensors) # 'features' is what is passed on to the model\n",
    "    \n",
    "    return tf.estimator.export.ServingInputReceiver(features = features, receiver_tensors = receiver_tensors)\n",
    "  \n",
    "#4. Train and Evaluate\n",
    "def train_and_evaluate(params):\n",
    "    OUTDIR = params[\"output_dir\"]\n",
    "\n",
    "    model = tf.estimator.DNNRegressor(\n",
    "        hidden_units = params[\"hidden_units\"].split(\",\"), # NEW: paramaterize architecture\n",
    "        feature_columns = feature_cols, \n",
    "        model_dir = OUTDIR,\n",
    "        config = tf.estimator.RunConfig(\n",
    "            tf_random_seed = 1, # for reproducibility\n",
    "            save_checkpoints_steps = max(100, params[\"train_steps\"] // 10) # checkpoint every N steps\n",
    "        ) \n",
    "    )\n",
    "\n",
    "    # Add custom evaluation metric\n",
    "    def my_rmse(labels, predictions):\n",
    "        pred_values = tf.squeeze(input = predictions[\"predictions\"], axis = -1)\n",
    "        return {\"rmse\": tf.metrics.root_mean_squared_error(labels = labels, predictions = pred_values)}\n",
    "    \n",
    "    model = tf.contrib.estimator.add_metrics(model, my_rmse)  \n",
    "\n",
    "    train_spec = tf.estimator.TrainSpec(\n",
    "        input_fn = lambda: train_input_fn(params[\"train_data_path\"]),\n",
    "        max_steps = params[\"train_steps\"])\n",
    "\n",
    "    exporter = tf.estimator.FinalExporter(name = \"exporter\", serving_input_receiver_fn = serving_input_receiver_fn) # export SavedModel once at the end of training\n",
    "    # Note: alternatively use tf.estimator.BestExporter to export at every checkpoint that has lower loss than the previous checkpoint\n",
    "\n",
    "\n",
    "    eval_spec = tf.estimator.EvalSpec(\n",
    "        input_fn = lambda: eval_input_fn(params[\"eval_data_path\"]),\n",
    "        steps = None,\n",
    "        start_delay_secs = 1, # wait at least N seconds before first evaluation (default 120)\n",
    "        throttle_secs = 1, # wait at least N seconds before each subsequent evaluation (default 600)\n",
    "        exporters = exporter) # export SavedModel once at the end of training\n",
    "\n",
    "    tf.logging.set_verbosity(v = tf.logging.INFO) # so loss is printed during training\n",
    "    shutil.rmtree(path = OUTDIR, ignore_errors = True) # start fresh each time\n",
    "\n",
    "    tf.estimator.train_and_evaluate(model, train_spec, eval_spec)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create task.py\n",
    "\n",
    "When doing hyperparameter tuning we need to make sure the output directory is different for each run, otherwise successive runs will overwrite previous runs. \n",
    "\n",
    "One way to do this is to append the trial id. This part of code can be removed if you are not using hyperparameter tuning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile taxifaremodel/task.py\n",
    "import argparse\n",
    "import json\n",
    "import os\n",
    "\n",
    "from . import model\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    parser = argparse.ArgumentParser()\n",
    "    parser.add_argument(\n",
    "        \"--hidden_units\",\n",
    "        help = \"Hidden layer sizes to use for DNN feature columns -- provide space-separated layers\",\n",
    "        type = str,\n",
    "        default = \"10,10\"\n",
    "    )\n",
    "    parser.add_argument(\n",
    "        \"--train_data_path\",\n",
    "        help = \"GCS or local path to training data\",\n",
    "        required = True\n",
    "    )\n",
    "    parser.add_argument(\n",
    "        \"--train_steps\",\n",
    "        help = \"Steps to run the training job for (default: 1000)\",\n",
    "        type = int,\n",
    "        default = 1000\n",
    "    )\n",
    "    parser.add_argument(\n",
    "        \"--eval_data_path\",\n",
    "        help = \"GCS or local path to evaluation data\",\n",
    "        required = True\n",
    "    )\n",
    "    parser.add_argument(\n",
    "        \"--output_dir\",\n",
    "        help = \"GCS location to write checkpoints and export models\",\n",
    "        required = True\n",
    "    )\n",
    "    parser.add_argument(\n",
    "        \"--job-dir\",\n",
    "        help=\"This is not used by our model, but it is required by gcloud\",\n",
    "    )\n",
    "    args = parser.parse_args().__dict__\n",
    "    \n",
    "    # Append trial_id to path so trials don\"t overwrite each other\n",
    "    args[\"output_dir\"] = os.path.join(\n",
    "        args[\"output_dir\"],\n",
    "        json.loads(\n",
    "            os.environ.get(\"TF_CONFIG\", \"{}\")\n",
    "        ).get(\"task\", {}).get(\"trial\", \"\")\n",
    "    ) \n",
    "        \n",
    "    # Run the training job\n",
    "    model.train_and_evaluate(args)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create hypertuning configuration \n",
    "\n",
    "We specify:\n",
    "1. How many trials to run (`maxTrials`) and how many of those trials can be run in parrallel (`maxParallelTrials`) \n",
    "2. Which algorithm to use (in this case `GRID_SEARCH`)\n",
    "3. Which metric to optimize (`hyperparameterMetricTag`)\n",
    "4. The search region in which to constrain the hyperparameter search\n",
    "\n",
    "Full specification options [here](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#HyperparameterSpec).\n",
    "\n",
    "Here we are just tuning one parameter, the number of hidden units, and we'll run all trials in parrallel. However more commonly you would tune multiple hyperparameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile hyperparam.yaml\n",
    "trainingInput:\n",
    "  scaleTier: BASIC\n",
    "  hyperparameters:\n",
    "    goal: MINIMIZE\n",
    "    maxTrials: 10\n",
    "    maxParallelTrials: 10\n",
    "    hyperparameterMetricTag: rmse\n",
    "    enableTrialEarlyStopping: True\n",
    "    algorithm: GRID_SEARCH\n",
    "    params:\n",
    "    - parameterName: hidden_units\n",
    "      type: CATEGORICAL\n",
    "      categoricalValues:\n",
    "      - 10,10\n",
    "      - 64,32\n",
    "      - 128,64,32\n",
    "      - 32,64,128\n",
    "      - 128,128,128\n",
    "      - 32,32,32\n",
    "      - 256,128,64,32\n",
    "      - 256,256,256,32\n",
    "      - 256,256,256,256\n",
    "      - 512,256,128,64,32"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run the training job ##\n",
    "\n",
    "Same as before with the addition of `--config=hyperpam.yaml` to reference the file we just created.\n",
    "\n",
    "This will take about 20 minutes. Go to [cloud console](https://console.cloud.google.com/mlengine/jobs) and click on the job id. Once the job is completed, the choosen hyperparameters and resulting objective value (RMSE in this case) will be shown. Trials will sorted from best to worst. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "OUTDIR=\"gs://{}/taxifare/trained_hp_tune\".format(BUCKET)\n",
    "!gcloud storage rm --recursive --continue-on-error {OUTDIR} # start fresh each time\n",    "!gcloud ai-platform jobs submit training taxifare_$(date -u +%y%m%d_%H%M%S) \\\n",
    "    --package-path=taxifaremodel \\\n",
    "    --module-name=taxifaremodel.task \\\n",
    "    --config=hyperparam.yaml \\\n",
    "    --job-dir=gs://{BUCKET}/taxifare \\\n",
    "    --python-version=3.5 \\\n",
    "    --runtime-version={TFVERSION} \\\n",
    "    --region={REGION} \\\n",
    "    -- \\\n",
    "    --train_data_path=gs://{BUCKET}/taxifare/smallinput/taxi-train.csv \\\n",
    "    --eval_data_path=gs://{BUCKET}/taxifare/smallinput/taxi-valid.csv  \\\n",
    "    --train_steps=5000 \\\n",
    "    --output_dir={OUTDIR}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Results\n",
    "\n",
    "The best result is RMSE **4.02** with hidden units = 128,64,32. \n",
    "\n",
    "This improvement is modest, but now that we have our hidden units tuned let's run on our larger dataset to see if it helps. \n",
    "\n",
    "Note the passing of hyperparameter values via command line"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "OUTDIR=\"gs://{}/taxifare/trained_large_tuned\".format(BUCKET)\n",
    "!gcloud storage rm --recursive --continue-on-error {OUTDIR} # start fresh each time\n",    "!gcloud ai-platform jobs submit training taxifare_large_$(date -u +%y%m%d_%H%M%S) \\\n",
    "    --package-path=taxifaremodel \\\n",
    "    --module-name=taxifaremodel.task \\\n",
    "    --job-dir=gs://{BUCKET}/taxifare \\\n",
    "    --python-version=3.5 \\\n",
    "    --runtime-version={TFVERSION} \\\n",
    "    --region={REGION} \\\n",
    "    --scale-tier=STANDARD_1 \\\n",
    "    -- \\\n",
    "    --train_data_path=gs://cloud-training-demos/taxifare/large/taxi-train*.csv \\\n",
    "    --eval_data_path=gs://cloud-training-demos/taxifare/small/taxi-valid.csv  \\\n",
    "    --train_steps=200000 \\\n",
    "    --output_dir={OUTDIR} \\\n",
    "    --hidden_units=\"128,64,32\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis\n",
    "\n",
    "Our RMSE improved to **3.85**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Challenge excercise\n",
    "\n",
    "Try to beat the current RMSE:\n",
    "\n",
    "- Try adding more engineered features or modifying existing ones\n",
    "- Try tuning additional hyperparameters "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
